SlideShare a Scribd company logo
1 of 50
Scholarly Infrastructure: Open or Closed?
Peter Murray-Rust*,
University of Cambridge and OpenKnowledge
DRTD-SHS, Lille, FR 2015-04-21
We can build an Open discovery and re-use system.
Theses represent huge untapped communal knowledge.
Bliss was it in that dawn to be alive,
But to be young was very heaven!
Wordsworth on the French Revolution
Scholarly infrastructure becomes closed
No accountability for monitoring and control
The Digital Enlightenment: some of my icons
Diderot, Paris, 1751
Berkeley, US, 1966 Paris, 1968
UK, 1969-73
["How We Stopped SOPAā€:
This bill ... shut down whole websites. Essentially, it stopped Americans from
communicating entirely with certain groups....
I called all my friends, and we stayed up all night setting up a website for this new group,
Demand Progress, with an online petition opposing this noxious bill.... We [got] ... 300,000
signers.... We met with the staff of members of Congress and pleaded with them.... And then
it passed unanimously....
And then, suddenly, the process stopped. Senator Ron Wyden ... put a hold on the
bill.[48][49]
He added, "We won this fight because everyone made themselves the hero of their own
story. Everyone took it as their job to save this crucial freedom.ā€
Robert Swartz: "Aaron was killed by the government, and MIT betrayed all of its basic
principles."[116]
Aaron Swartz
Some Children
of the Digital Enlightenment
ā€¢ David Carroll & Joe McArthur: OAButton
ā€¢ Rayna Stamboliyska & Pierre-Carl Langlais
ā€¢ Jon Tennant
ā€¢ Ross Mounce
ā€¢ Jenny Molloy
ā€¢ Erin McKiernan
ā€¢ Jack Andraka
ā€¢ Michelle Brook
ā€¢ Heather Piwowar
ā€¢ TheContentMine Team
ā€¢ Rufus Pollock
ā€¢ Jonathan Gray
ā€¢ Sophie Kay
Jean-Claude Bradley [1] a chemist
developed Open notebook science;
making the entire primary record of a
research project publicly available
online as it is recorded. (WP)
J-C promoted these ideas with
UNDERGRADUATE scientists.
[1] Unfortunately J-C died in 2014;
we held a memorial meeting in
Cambridge
Sophie
Kay
http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-
ebola.html
We were stunned recently when we stumbled across an article by European
researchers in Annals of Virology [1982]: ā€œThe results seem to indicate that
Liberia has to be included in the Ebola virus endemic zone.ā€ In the future,
the authors asserted, ā€œmedical personnel in Liberian health centers should be
aware of the possibility that they may come across active cases and thus be
prepared to avoid nosocomial epidemics,ā€ referring to hospital-acquired
infection.
Adage in public health: ā€œThe road to inaction is paved with research
papers.ā€
Bernice Dahn (chief medical officer of Liberiaā€™s Ministry of Health)
Vera Mussah (director of county health services)
Cameron Nutt (Ebola response adviser to Partners in Health)
A System Failure of Scholarly Publishing
Open Scholarship must build its own
discovery system before it is too late
Communities of Practice + software:
ā€¢ Wikip(m)edia
ā€¢ Open Street Map
ā€¢ Open Corporates
Theses are under OUR control and hugely valuable.
eTheses
ā€¢ Citizens pay $20,000,000,000*ā€¦
ā€¢ ā€¦ for research in 200,000 science theses*ā€¦
ā€¢ ā€¦ cost $100,000 each to create* ā€¦
ā€¢ ā€¦ re-use ??? (near zero)
ā€¢ ā€¦ Value???
ā€¢ *Please challenge these numbersā€¦
ā€¢ NOTE: we pay publishers $15,000,000,000 for
journals and APCs
Linked Open Data ā€“ the worldā€™s knowledge
very little physical science and THESES?? ļŒ
http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png
DBPedia
BIO
Comp
Lib
PDB
Ontologies
GOV
GOV.uk
Music,
Art
Literature
Social
Knowledge
bases
RDF
triples
Liberation Software
Steve Coast developed OpenStreetMap
to challenge the monopoly of the UK Ordnance Survey
The Right to Read is the Right to Mine
http://contentmine.org
OUR TEAM
@jenny_molloy
Ross Mounce
@rmounce
Richard Smith-
Unna
@blahah404
Stephanie Smith-
Unna
@treblesteph
Jenny Molloy
Mark
MacGillivray
@cottagelabs
Peter Murray-
Rust
@petermurrayrust
Charles Oppenheim
@CharlesOppenh
Graham
Steel
@McDawg
https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-
enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0
Daily Stream of 100,000 Open Facts
Twitter?Indexed by CAT
http://catalogue.cottagelabs.com/browsehttp://catalogue.cottagelabs.com/graph
Content-Mining (TDM*)
ā€¢ Now COMPLETELY LEGAL IN UK since 2014-06-01
(ā€œHargreavesā€)ā€¦
ā€¢ ā€¦ Whatever the publishers tell you. Do NOT sign
their APIs
ā€¢ UK can legally IGNORE contractual restrictions
ā€¢ Movement to extend this to Europe (Julia Reda,
MEP proposal)
ā€¢ And STM publishers are spending millions to stop
us
*Text and Data Mining
What is ā€œContentā€?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113
03&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRY
TEXT
MATH
contentmine.org tackles these
ā€œnuggetsā€ in a scientific paper
quantity
units
Value ranges
Humans arenā€™t designed to mine this ā€¦ ļŒ
chemical
project places
What is ā€œContentā€?
Emily Sena (neuroscience.ed.ac.uk) spends
half a day digitising a diagram like this
ContentMine will soon be able to do it in 1 second
ā€¢ CRAWL the web for scientific documents
(articles, grey literature, repositories)
ā€¢ quickSCRAPE pages (text, graphics, images, data)
ā€¢ NORMA-lize page to semantic form
ā€¦Open semantic science ā€¦
ā€¢ MINE pages with your methods and tools (AMI)
ā€¢ CAT-alogue results in searchable index
ā€¢ Automate daily process (CANARY)
contentmine.org Infrastructure
quickscrape
Crawl
Feed
Norma Index &
Transform
PDF
XML
URL
DOI
Scientific
literature
Repositories DOC
CSV
sHTML
Plugins
Regex
SequencesSpecies
Bespoke
Scrapers
XPathPer-Journal
Taggers
Per- Journal
MetadataChemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
Open NORMA-lized Scientific
Literature + Facts
CANARY pipeline
CAT-alogue index
CORE Repository UK
HAL repository FR
Retrieval/Extraction Technologies
ā€¢ Bag Of Words https://en.wikipedia.org/wiki/Bag-of-words_model)
ā€¢ Term-Frequency Inverse-Document-Frequency
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
ā€¢ Regular Expressions
ā€¢ Templates (Information Extraction)
ā€¢ Natural Language Processing (NLP)
ā€¢ Image processing and mining
ā€¢ Lookup (Wikidata, Bioscience databases)
Bag of Words
Theses from HAL repository
Species
Regex for Clinical Trials
CLINICAL TRIALS
How to we find (mentions of) clinical trials?
Is a document a (clinical) trial?
What is the subject of the trial?
What is the methodology used? How many/long?
Does the design and practice conform to CONSORT?
What are the outcomes?
Can we extract specific re-usable information?
Who are involved? (researchers, sponsors, patients?)
Has a proposed trial been completed and reported?
How a machine reads a chemical thesis
nodes are compounds; arrows are reactions
Natural Language Processing
Part of speech tagging (Wordnet, Brown Corpus, etc.)
Parsing chemical sentences
http://chemicaltagger.ch.cam.ac.uk/
ā€¢ Typical
Typical chemical synthesis
Automatic semantic markup of chemistry
Could be used for analytical, crystallization, etc.
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram,
recognizes the paths and
generates the molecules. Then
she creates a stop-fram animation
showing how the 12 reactions
lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
Evolution of ultraviolet
vision in the largest avian
radiation - the passerines
Anders Ɩdeen 1* , Olle
HƄstad 2,3 and Per Alstrƶm 4
PDF ļŒ
HTML ļŠ
Styles , superscripts
And diƄcritics
preserved!
AMI
PDF ļŒ
Turdus iliacus
Taeniopygia guttata
Serinus canaria
Lanius excubitor
Melopsittacus undulatus
Pavo cristatus
Sturnus vulgaris
Dolichonyx oryzivorus
Ficedula hypoleuca
Vaccinium myrtillus
Falco tinnunculus
Turdus
Pomatostomus
Leothrix
Amytornis
Acanthisitta
Orthonyx x 2
Malurus
Cnemophilus x 4
Philesturnus x 2
Motacilla x 2
Toxorhampus x 2
Typical phylo tree: 60 nodes, complex and miniscule annotation,
vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
Acanthisittidae
Acanthizidae
Acrocephalidae
Callaeidae
Campephagidae
Cnemophilidae
Corvidae
0.84
0.91
0.93
0.95
Acanthisitta
Acrocephalus
Ailuroedus
Ailuroedus
Amytornis
Camptostoma
AMI
23.12
34.54
37.21
38.55
Posterior
probability
AMI can MEASURE
Branch lengths!
NexML
Genus Family
HTML
https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-
mine-images-of-phylogenetic-trees-and-more/ for story of extraction
Thinning Topology
Serialization
Newick
Problems
ā€¢ Cannot do handwriting
ā€¢ Scanned documents give poorer results
ā€¢ The older the document the poorer the result
ā€¢ Tables are a major problem
ā€¢ Always try to get the original document
ā€¢ XML better than > Word better than > PDF
ā€¢ Vector images >> PNG > JPEG
ā€¢ Maths, chemistry are specialist
Additional material on Open Notebook
Science (not presented)
Free/Open Software Development
Engineered
repository
World
community
CODE
rewrite
validate
CODE
fork
CODE
Re-use
CODE
Re-use
Github, BitBucket
StackOverflow,
Apache
inspires
OSI
Example: ContentMine at
http://github.com/ContentMine/quickscrape
Sophie Kershaw, Panton Fellow, Training PhD Students
ā€œDo you think you would be
more confident in the future
about trying to apply Open
techniques to your work..?ā€
ā€¢ 50% Yes, by myself
ā€¢ 41% Yes, with help/guidance
ā€¢ 9% No opinion/neutral
ā€¢ 0% No
Rotation-Based Learning (RBL)
Phase 1: Initiator
ā€¢ No communication
permitted between groups
ā€¢ Attempt to reproduce
existing literature
ā€¢ Deliver a coherent research
story by the end of Phase 1
Phase 2: Successor
ā€¢ Communication between
groups still prohibited
ā€¢ Validate and develop the
inherited research story
ā€¢ Critique your predecessors
ā€¢ Role of research producer vs. research user
ā€¢ Can this approach help to foster awareness of reproducibility issues?
Throughout Phases 1 & 2:
ā€¢ Daily lectures on open
science culture & techniques
ā€¢ First-hand application to own
research work
ā€¢ Version control using GitHub
ā€¢ Daily group supervision
http://michaelnielsen.org/blog/reinventing-
discovery/
http://en.wikipedia.org/wiki/Reinventing_Discovery
TOOLS
Open Notebook Science
Open
engineered
repository
World
community
INSTRUMENT
validate
merge
MODEL
CODE
DATA
DATA
knowledge
calibrate
Problems are solved communally;
Nothing is needlessly duplicated; ā€œpublicationā€œ is
continuous
Machines
and humans
Working
together
CC-BY
ā€œFreeā€ and ā€œOpenā€
ā€¢ "Free software is a matter of liberty, not price.
ā€™free speech', not 'free beer'ā€. (R M Stallman)
ā€¢ ā€œA piece of data or content is open if anyone is
free to use, reuse, and redistribute itā€
(OKFN)http://opendefinition.org/
ā€¢ ā€œopenā€ (access) has multiple incompatible ā€œdefinitionsā€. Major split
is ā€œhuman eyeballsā€ vs copying and machine ā€œreusabilityā€
ā€¢ ā€œOpenā€ is a marketing term for publishers, who frequently (often
deliberately) do not grant full Openness.
ā€œGratisā€ vs ā€œLibreā€
Critical Historical Open Events
ā€¢ Free Software Foundation (RMS,
1985) and Linux (Torvalds, 1991)
ā€¢ The World Wide Web (TBL, 1991)
ā€¢ The human genome (1990-2001)
The life of Aaron Swartz (1986-2013)
http://www.budapestopenaccessinitiative.org/read
ā€¦ an unprecedented public good. ā€¦
ā€¦ completely free and unrestricted access to [peer-
reviewed literature] by all scientists, scholars, teachers,
students, and other curious minds. ā€¦
ā€¦Removing access barriers to this literature will
accelerate research, enrich education, share the
learning of the rich with the poor and the poor with
the rich, make this literature as useful as it can be, and
lay the foundation for uniting humanity in a common
intellectual conversation and quest for knowledge.
(Budapest Open Access Initiative, 2003)
Panton Authors and Fellows

More Related Content

What's hot

ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neurosciencepetermurrayrust
Ā 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humanspetermurrayrust
Ā 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData TheContentMine
Ā 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)petermurrayrust
Ā 
Copyright Reform and Open Data
Copyright Reform and Open DataCopyright Reform and Open Data
Copyright Reform and Open Datapetermurrayrust
Ā 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureTheContentMine
Ā 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literaturepetermurrayrust
Ā 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literaturepetermurrayrust
Ā 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 TheContentMine
Ā 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? TheContentMine
Ā 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technologypetermurrayrust
Ā 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiDataTheContentMine
Ā 
Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]Ross Mounce
Ā 
Cochrane workshop2016
Cochrane workshop2016Cochrane workshop2016
Cochrane workshop2016petermurrayrust
Ā 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic BiologyTheContentMine
Ā 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biologypetermurrayrust
Ā 
Disrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic ComplexDisrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic Complexpetermurrayrust
Ā 
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Ross Mounce
Ā 

What's hot (20)

ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
Ā 
Csvconf
CsvconfCsvconf
Csvconf
Ā 
Ebi
EbiEbi
Ebi
Ā 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humans
Ā 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
Ā 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
Ā 
Copyright Reform and Open Data
Copyright Reform and Open DataCopyright Reform and Open Data
Copyright Reform and Open Data
Ā 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
Ā 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
Ā 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
Ā 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
Ā 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape?
Ā 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technology
Ā 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
Ā 
Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]
Ā 
Cochrane workshop2016
Cochrane workshop2016Cochrane workshop2016
Cochrane workshop2016
Ā 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
Ā 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
Ā 
Disrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic ComplexDisrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic Complex
Ā 
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Ā 

Similar to ContentMine: Liberating scholarship from Open publications and theses

Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyonepetermurrayrust
Ā 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)TheContentMine
Ā 
Making Theses USEFUL
Making Theses USEFULMaking Theses USEFUL
Making Theses USEFULTheContentMine
Ā 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesTheContentMine
Ā 
Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migrationpetermurrayrust
Ā 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKpetermurrayrust
Ā 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machinespetermurrayrust
Ā 
Digital Scholarship
Digital ScholarshipDigital Scholarship
Digital Scholarshippetermurrayrust
Ā 
Paradise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to MineParadise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to Minepetermurrayrust
Ā 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in NeuroscienceTheContentMine
Ā 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in NeuroscienceTheContentMine
Ā 
Social Machines of Scholarly Collaboration
Social Machines of Scholarly CollaborationSocial Machines of Scholarly Collaboration
Social Machines of Scholarly CollaborationDavid De Roure
Ā 
High throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIHHigh throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIHpetermurrayrust
Ā 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureTheContentMine
Ā 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017petermurrayrust
Ā 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchDatapetermurrayrust
Ā 
Can machines understand the scientific literature
Can machines understand the scientific literatureCan machines understand the scientific literature
Can machines understand the scientific literaturepetermurrayrust
Ā 
Young people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge NeocolonialismYoung people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge Neocolonialismpetermurrayrust
Ā 
Social Machines Paradigm
Social Machines ParadigmSocial Machines Paradigm
Social Machines ParadigmDavid De Roure
Ā 
Internet Freedom Festival Recap, 2016
Internet Freedom Festival Recap, 2016Internet Freedom Festival Recap, 2016
Internet Freedom Festival Recap, 2016Robert Stribley
Ā 

Similar to ContentMine: Liberating scholarship from Open publications and theses (20)

Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyone
Ā 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
Ā 
Making Theses USEFUL
Making Theses USEFULMaking Theses USEFUL
Making Theses USEFUL
Ā 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
Ā 
Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migration
Ā 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
Ā 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
Ā 
Digital Scholarship
Digital ScholarshipDigital Scholarship
Digital Scholarship
Ā 
Paradise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to MineParadise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to Mine
Ā 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
Ā 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
Ā 
Social Machines of Scholarly Collaboration
Social Machines of Scholarly CollaborationSocial Machines of Scholarly Collaboration
Social Machines of Scholarly Collaboration
Ā 
High throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIHHigh throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIH
Ā 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
Ā 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017
Ā 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchData
Ā 
Can machines understand the scientific literature
Can machines understand the scientific literatureCan machines understand the scientific literature
Can machines understand the scientific literature
Ā 
Young people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge NeocolonialismYoung people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge Neocolonialism
Ā 
Social Machines Paradigm
Social Machines ParadigmSocial Machines Paradigm
Social Machines Paradigm
Ā 
Internet Freedom Festival Recap, 2016
Internet Freedom Festival Recap, 2016Internet Freedom Festival Recap, 2016
Internet Freedom Festival Recap, 2016
Ā 

More from petermurrayrust

Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Agepetermurrayrust
Ā 
Open Science Principles and Practice
Open Science Principles and PracticeOpen Science Principles and Practice
Open Science Principles and Practicepetermurrayrust
Ā 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentationpetermurrayrust
Ā 
Can machines understand the scientific literature?
Can machines understand the scientific literature?Can machines understand the scientific literature?
Can machines understand the scientific literature?petermurrayrust
Ā 
OpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFestOpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFestpetermurrayrust
Ā 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentationpetermurrayrust
Ā 
Automatic mining of data from materials science literature
Automatic mining of data from materials science literatureAutomatic mining of data from materials science literature
Automatic mining of data from materials science literaturepetermurrayrust
Ā 
openVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusesopenVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusespetermurrayrust
Ā 
XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?petermurrayrust
Ā 
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be BraveEarly Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be Bravepetermurrayrust
Ā 
Early Career Reseachers and Open Healthcare
Early Career Reseachers and Open HealthcareEarly Career Reseachers and Open Healthcare
Early Career Reseachers and Open Healthcarepetermurrayrust
Ā 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search petermurrayrust
Ā 
Openplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingOpenplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingpetermurrayrust
Ā 
Extracting science from the archive
Extracting science from the archiveExtracting science from the archive
Extracting science from the archivepetermurrayrust
Ā 
WikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and EverythingWikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and Everythingpetermurrayrust
Ā 
WikiFactMine: Science for Everyone
WikiFactMine: Science for EveryoneWikiFactMine: Science for Everyone
WikiFactMine: Science for Everyonepetermurrayrust
Ā 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Librariespetermurrayrust
Ā 
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?petermurrayrust
Ā 
WikiFactMine for Plant Chemistry
WikiFactMine for Plant ChemistryWikiFactMine for Plant Chemistry
WikiFactMine for Plant Chemistrypetermurrayrust
Ā 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literaturepetermurrayrust
Ā 

More from petermurrayrust (20)

Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Ā 
Open Science Principles and Practice
Open Science Principles and PracticeOpen Science Principles and Practice
Open Science Principles and Practice
Ā 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
Ā 
Can machines understand the scientific literature?
Can machines understand the scientific literature?Can machines understand the scientific literature?
Can machines understand the scientific literature?
Ā 
OpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFestOpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFest
Ā 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
Ā 
Automatic mining of data from materials science literature
Automatic mining of data from materials science literatureAutomatic mining of data from materials science literature
Automatic mining of data from materials science literature
Ā 
openVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusesopenVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on viruses
Ā 
XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?
Ā 
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be BraveEarly Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Ā 
Early Career Reseachers and Open Healthcare
Early Career Reseachers and Open HealthcareEarly Career Reseachers and Open Healthcare
Early Career Reseachers and Open Healthcare
Ā 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
Ā 
Openplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingOpenplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searching
Ā 
Extracting science from the archive
Extracting science from the archiveExtracting science from the archive
Extracting science from the archive
Ā 
WikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and EverythingWikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and Everything
Ā 
WikiFactMine: Science for Everyone
WikiFactMine: Science for EveryoneWikiFactMine: Science for Everyone
WikiFactMine: Science for Everyone
Ā 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Libraries
Ā 
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
Ā 
WikiFactMine for Plant Chemistry
WikiFactMine for Plant ChemistryWikiFactMine for Plant Chemistry
WikiFactMine for Plant Chemistry
Ā 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literature
Ā 

Recently uploaded

How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
Ā 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
Ā 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
Ā 
call girls in Kamla Market (DELHI) šŸ” >ą¼’9953330565šŸ” genuine Escort Service šŸ”āœ”ļøāœ”ļø
call girls in Kamla Market (DELHI) šŸ” >ą¼’9953330565šŸ” genuine Escort Service šŸ”āœ”ļøāœ”ļøcall girls in Kamla Market (DELHI) šŸ” >ą¼’9953330565šŸ” genuine Escort Service šŸ”āœ”ļøāœ”ļø
call girls in Kamla Market (DELHI) šŸ” >ą¼’9953330565šŸ” genuine Escort Service šŸ”āœ”ļøāœ”ļø9953056974 Low Rate Call Girls In Saket, Delhi NCR
Ā 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
Ā 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
Ā 
18-04-UA_REPORT_MEDIALITERAŠ”Y_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAŠ”Y_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAŠ”Y_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAŠ”Y_INDEX-DM_23-1-final-eng.pdfssuser54595a
Ā 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
Ā 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
Ā 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
Ā 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
Ā 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
Ā 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
Ā 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
Ā 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
Ā 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
Ā 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
Ā 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
Ā 

Recently uploaded (20)

How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
Ā 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
Ā 
Model Call Girl in Tilak Nagar Delhi reach out to us at šŸ”9953056974šŸ”
Model Call Girl in Tilak Nagar Delhi reach out to us at šŸ”9953056974šŸ”Model Call Girl in Tilak Nagar Delhi reach out to us at šŸ”9953056974šŸ”
Model Call Girl in Tilak Nagar Delhi reach out to us at šŸ”9953056974šŸ”
Ā 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
Ā 
call girls in Kamla Market (DELHI) šŸ” >ą¼’9953330565šŸ” genuine Escort Service šŸ”āœ”ļøāœ”ļø
call girls in Kamla Market (DELHI) šŸ” >ą¼’9953330565šŸ” genuine Escort Service šŸ”āœ”ļøāœ”ļøcall girls in Kamla Market (DELHI) šŸ” >ą¼’9953330565šŸ” genuine Escort Service šŸ”āœ”ļøāœ”ļø
call girls in Kamla Market (DELHI) šŸ” >ą¼’9953330565šŸ” genuine Escort Service šŸ”āœ”ļøāœ”ļø
Ā 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
Ā 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
Ā 
18-04-UA_REPORT_MEDIALITERAŠ”Y_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAŠ”Y_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAŠ”Y_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAŠ”Y_INDEX-DM_23-1-final-eng.pdf
Ā 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
Ā 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
Ā 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
Ā 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
Ā 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
Ā 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Ā 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
Ā 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
Ā 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
Ā 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
Ā 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
Ā 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
Ā 

ContentMine: Liberating scholarship from Open publications and theses

  • 1. Scholarly Infrastructure: Open or Closed? Peter Murray-Rust*, University of Cambridge and OpenKnowledge DRTD-SHS, Lille, FR 2015-04-21 We can build an Open discovery and re-use system. Theses represent huge untapped communal knowledge. Bliss was it in that dawn to be alive, But to be young was very heaven! Wordsworth on the French Revolution
  • 2. Scholarly infrastructure becomes closed No accountability for monitoring and control
  • 3. The Digital Enlightenment: some of my icons Diderot, Paris, 1751 Berkeley, US, 1966 Paris, 1968 UK, 1969-73
  • 4. ["How We Stopped SOPAā€: This bill ... shut down whole websites. Essentially, it stopped Americans from communicating entirely with certain groups.... I called all my friends, and we stayed up all night setting up a website for this new group, Demand Progress, with an online petition opposing this noxious bill.... We [got] ... 300,000 signers.... We met with the staff of members of Congress and pleaded with them.... And then it passed unanimously.... And then, suddenly, the process stopped. Senator Ron Wyden ... put a hold on the bill.[48][49] He added, "We won this fight because everyone made themselves the hero of their own story. Everyone took it as their job to save this crucial freedom.ā€ Robert Swartz: "Aaron was killed by the government, and MIT betrayed all of its basic principles."[116] Aaron Swartz
  • 5. Some Children of the Digital Enlightenment ā€¢ David Carroll & Joe McArthur: OAButton ā€¢ Rayna Stamboliyska & Pierre-Carl Langlais ā€¢ Jon Tennant ā€¢ Ross Mounce ā€¢ Jenny Molloy ā€¢ Erin McKiernan ā€¢ Jack Andraka ā€¢ Michelle Brook ā€¢ Heather Piwowar ā€¢ TheContentMine Team ā€¢ Rufus Pollock ā€¢ Jonathan Gray ā€¢ Sophie Kay Jean-Claude Bradley [1] a chemist developed Open notebook science; making the entire primary record of a research project publicly available online as it is recorded. (WP) J-C promoted these ideas with UNDERGRADUATE scientists. [1] Unfortunately J-C died in 2014; we held a memorial meeting in Cambridge Sophie Kay
  • 6. http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about- ebola.html We were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: ā€œThe results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.ā€ In the future, the authors asserted, ā€œmedical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be prepared to avoid nosocomial epidemics,ā€ referring to hospital-acquired infection. Adage in public health: ā€œThe road to inaction is paved with research papers.ā€ Bernice Dahn (chief medical officer of Liberiaā€™s Ministry of Health) Vera Mussah (director of county health services) Cameron Nutt (Ebola response adviser to Partners in Health) A System Failure of Scholarly Publishing
  • 7. Open Scholarship must build its own discovery system before it is too late Communities of Practice + software: ā€¢ Wikip(m)edia ā€¢ Open Street Map ā€¢ Open Corporates Theses are under OUR control and hugely valuable.
  • 8. eTheses ā€¢ Citizens pay $20,000,000,000*ā€¦ ā€¢ ā€¦ for research in 200,000 science theses*ā€¦ ā€¢ ā€¦ cost $100,000 each to create* ā€¦ ā€¢ ā€¦ re-use ??? (near zero) ā€¢ ā€¦ Value??? ā€¢ *Please challenge these numbersā€¦ ā€¢ NOTE: we pay publishers $15,000,000,000 for journals and APCs
  • 9. Linked Open Data ā€“ the worldā€™s knowledge very little physical science and THESES?? ļŒ http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png DBPedia BIO Comp Lib PDB Ontologies GOV GOV.uk Music, Art Literature Social Knowledge bases RDF triples
  • 10. Liberation Software Steve Coast developed OpenStreetMap to challenge the monopoly of the UK Ordnance Survey
  • 11. The Right to Read is the Right to Mine http://contentmine.org
  • 12. OUR TEAM @jenny_molloy Ross Mounce @rmounce Richard Smith- Unna @blahah404 Stephanie Smith- Unna @treblesteph Jenny Molloy Mark MacGillivray @cottagelabs Peter Murray- Rust @petermurrayrust Charles Oppenheim @CharlesOppenh Graham Steel @McDawg
  • 13. https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump- enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0 Daily Stream of 100,000 Open Facts Twitter?Indexed by CAT http://catalogue.cottagelabs.com/browsehttp://catalogue.cottagelabs.com/graph
  • 14. Content-Mining (TDM*) ā€¢ Now COMPLETELY LEGAL IN UK since 2014-06-01 (ā€œHargreavesā€)ā€¦ ā€¢ ā€¦ Whatever the publishers tell you. Do NOT sign their APIs ā€¢ UK can legally IGNORE contractual restrictions ā€¢ Movement to extend this to Europe (Julia Reda, MEP proposal) ā€¢ And STM publishers are spending millions to stop us *Text and Data Mining
  • 16. ā€œnuggetsā€ in a scientific paper quantity units Value ranges Humans arenā€™t designed to mine this ā€¦ ļŒ chemical project places
  • 17. What is ā€œContentā€? Emily Sena (neuroscience.ed.ac.uk) spends half a day digitising a diagram like this ContentMine will soon be able to do it in 1 second
  • 18. ā€¢ CRAWL the web for scientific documents (articles, grey literature, repositories) ā€¢ quickSCRAPE pages (text, graphics, images, data) ā€¢ NORMA-lize page to semantic form ā€¦Open semantic science ā€¦ ā€¢ MINE pages with your methods and tools (AMI) ā€¢ CAT-alogue results in searchable index ā€¢ Automate daily process (CANARY) contentmine.org Infrastructure
  • 19. quickscrape Crawl Feed Norma Index & Transform PDF XML URL DOI Scientific literature Repositories DOC CSV sHTML Plugins Regex SequencesSpecies Bespoke Scrapers XPathPer-Journal Taggers Per- Journal MetadataChemistry Phylogenetics Farming AMI BadHTML OCR Diagrams Open NORMA-lized Scientific Literature + Facts CANARY pipeline CAT-alogue index
  • 22. Retrieval/Extraction Technologies ā€¢ Bag Of Words https://en.wikipedia.org/wiki/Bag-of-words_model) ā€¢ Term-Frequency Inverse-Document-Frequency https://en.wikipedia.org/wiki/Tf%E2%80%93idf ā€¢ Regular Expressions ā€¢ Templates (Information Extraction) ā€¢ Natural Language Processing (NLP) ā€¢ Image processing and mining ā€¢ Lookup (Wikidata, Bioscience databases)
  • 23. Bag of Words Theses from HAL repository
  • 26. CLINICAL TRIALS How to we find (mentions of) clinical trials? Is a document a (clinical) trial? What is the subject of the trial? What is the methodology used? How many/long? Does the design and practice conform to CONSORT? What are the outcomes? Can we extract specific re-usable information? Who are involved? (researchers, sponsors, patients?) Has a proposed trial been completed and reported?
  • 27. How a machine reads a chemical thesis nodes are compounds; arrows are reactions
  • 28. Natural Language Processing Part of speech tagging (Wordnet, Brown Corpus, etc.)
  • 31. Automatic semantic markup of chemistry Could be used for analytical, crystallization, etc.
  • 32. Open Content Mining of FACTs Machines can interpret chemical reactions We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
  • 33. AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY: AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other CLICK HERE FOR ANIMATION (may be browser dependent)
  • 34. Evolution of ultraviolet vision in the largest avian radiation - the passerines Anders Ɩdeen 1* , Olle HĆ„stad 2,3 and Per Alstrƶm 4 PDF ļŒ HTML ļŠ Styles , superscripts And diĆ„critics preserved! AMI
  • 35. PDF ļŒ Turdus iliacus Taeniopygia guttata Serinus canaria Lanius excubitor Melopsittacus undulatus Pavo cristatus Sturnus vulgaris Dolichonyx oryzivorus Ficedula hypoleuca Vaccinium myrtillus Falco tinnunculus Turdus Pomatostomus Leothrix Amytornis Acanthisitta Orthonyx x 2 Malurus Cnemophilus x 4 Philesturnus x 2 Motacilla x 2 Toxorhampus x 2
  • 36. Typical phylo tree: 60 nodes, complex and miniscule annotation, vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
  • 39. Problems ā€¢ Cannot do handwriting ā€¢ Scanned documents give poorer results ā€¢ The older the document the poorer the result ā€¢ Tables are a major problem ā€¢ Always try to get the original document ā€¢ XML better than > Word better than > PDF ā€¢ Vector images >> PNG > JPEG ā€¢ Maths, chemistry are specialist
  • 40. Additional material on Open Notebook Science (not presented)
  • 41. Free/Open Software Development Engineered repository World community CODE rewrite validate CODE fork CODE Re-use CODE Re-use Github, BitBucket StackOverflow, Apache inspires OSI Example: ContentMine at http://github.com/ContentMine/quickscrape
  • 42. Sophie Kershaw, Panton Fellow, Training PhD Students
  • 43. ā€œDo you think you would be more confident in the future about trying to apply Open techniques to your work..?ā€ ā€¢ 50% Yes, by myself ā€¢ 41% Yes, with help/guidance ā€¢ 9% No opinion/neutral ā€¢ 0% No
  • 44. Rotation-Based Learning (RBL) Phase 1: Initiator ā€¢ No communication permitted between groups ā€¢ Attempt to reproduce existing literature ā€¢ Deliver a coherent research story by the end of Phase 1 Phase 2: Successor ā€¢ Communication between groups still prohibited ā€¢ Validate and develop the inherited research story ā€¢ Critique your predecessors ā€¢ Role of research producer vs. research user ā€¢ Can this approach help to foster awareness of reproducibility issues? Throughout Phases 1 & 2: ā€¢ Daily lectures on open science culture & techniques ā€¢ First-hand application to own research work ā€¢ Version control using GitHub ā€¢ Daily group supervision
  • 46. TOOLS Open Notebook Science Open engineered repository World community INSTRUMENT validate merge MODEL CODE DATA DATA knowledge calibrate Problems are solved communally; Nothing is needlessly duplicated; ā€œpublicationā€œ is continuous Machines and humans Working together CC-BY
  • 47. ā€œFreeā€ and ā€œOpenā€ ā€¢ "Free software is a matter of liberty, not price. ā€™free speech', not 'free beer'ā€. (R M Stallman) ā€¢ ā€œA piece of data or content is open if anyone is free to use, reuse, and redistribute itā€ (OKFN)http://opendefinition.org/ ā€¢ ā€œopenā€ (access) has multiple incompatible ā€œdefinitionsā€. Major split is ā€œhuman eyeballsā€ vs copying and machine ā€œreusabilityā€ ā€¢ ā€œOpenā€ is a marketing term for publishers, who frequently (often deliberately) do not grant full Openness. ā€œGratisā€ vs ā€œLibreā€
  • 48. Critical Historical Open Events ā€¢ Free Software Foundation (RMS, 1985) and Linux (Torvalds, 1991) ā€¢ The World Wide Web (TBL, 1991) ā€¢ The human genome (1990-2001) The life of Aaron Swartz (1986-2013)
  • 49. http://www.budapestopenaccessinitiative.org/read ā€¦ an unprecedented public good. ā€¦ ā€¦ completely free and unrestricted access to [peer- reviewed literature] by all scientists, scholars, teachers, students, and other curious minds. ā€¦ ā€¦Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge. (Budapest Open Access Initiative, 2003)

Editor's Notes

  1. Hi, Iā€™m here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture. In this talk, Iā€™m going to impress the importance of data in a specific format and its utility to automated machine processing. Then Iā€™m going to demonstrate AMIā€™s architecture and the transformation of data as it flows through the process. Iā€™m going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, Iā€™m going to introduce Andyā€™s ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.
  2. Because information is structured (some examples listed), we can aggregate similar objects and mine using a modular systematic approach.
  3. Because information is structured (some examples listed), we can aggregate similar objects and mine using a modular systematic approach.